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ABSTRACT 

Applications  of  the  EM  algorithm  to  the  estimation  of  Bayesian 
hyperparameters  are  discussed  and  reviewed  in  the  context  of  the 
author's  philosophy  involving  the  inductive  and  pragmatic  modelling  of 
sampling  distributions  and  prior  structures#  Frequently  the  hyper¬ 
parameters  may  be  estimated  from  the  data,  thus  avoiding  the  subjective 
assessment  of  these  values •  The  ideas  are  applied  to  multiple  regres¬ 
sion  models,  histograms  and  multinomial  distributions.  A  numerical 
example  is  described  in  the  context  of  smoothing  the  cell  probabilities 
of  several  multinomial  distributions. 
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APPLICATIONS  OF  THE  EM  ALGORITHM 
TO  THE  ESTIMATION  OF  BAYESIAN  HYPERPARAMETERS 

Tom  Leonard 

1 .  Background  and  Statistical  Ideas* 

The  resurgence  in  Bayesian  statistics  over  the  past  fifteen  or  so  years 
has  been  due  to  its  recognition,  as  a  methodology  based  upon  pure  probability 
theory  and  hence  free  from  theoretical  counterexample,  and  as  an  approach  to 
scientific  investigation  which  assists  the  deductive,  inductive,  and  pragmatic 
reasoning  of  the  statistician  in  a  meaningful  and  illuminating  way.  In  his 
writings  Professor  Bruno  De  Finetti  has  highlighted  three  fundamental  con¬ 
cepts.  These  are  Coherence,  Scientific  Induction,  and  Exchangeability.  All 
these  are  related  to  the  very  necessary  iteration  between  inference  about 
parameters,  conditional  on  the  truth  of  the  sampling  model  (coherence,  deduc¬ 
tion)  and  the  process  of  scientific  modelling  (inductive  and  pragmatic  rea¬ 
soning  together  with  more  formal  theoretical  procedures).  For  accounts  of 
scientific  modelling  in  a  Bayesian  context  see  Box  (1980)  and  Leonard  (1978, 
1981,  1982). 

Many  traditional  Bayesians  (e.g.  Lindley  et  al.,  1978)  have  pursued  the 
concept  of  coherence  on  its  own  and  have  concentrated  their  energies  on  trying 
to  extract  coherent  subjective  distributions  from  the  scientist.  Less  atten¬ 
tion  has  been  paid  to  the  statistical  problem  of  gaining  insight  from  data 
sets  in  relation  to  their  scientific  background.  Whilst  prior  distributions 
are  very  useful  mathematical  and  philosophical  constructs  for  generating 
meaningful  statistical  procedures,  it  is  unlikely  that  all  the  useful  know¬ 
ledge  possessed  by  a  scientific  expert  will  be  representable  in  the  form  of  a 
probability  distribution.  An  expert’s  information  is  usually  more  complex  and 
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diverse;  it  is  usually  necessary  to  extract  this  information  via  an  interation 
between  the  statistician,  the  expert,  and  the  statistical  methodology,  the 
data,  and  the  scientific  background.  Pragmatic  and  inductive  judgements  seem 
more  appropriate  than  trying  to  extend  coherence  for  inference  about  param¬ 
eters,  given  the  model,  to  coherence  of  the  whole  statistical  procedure.  I 
see  Bayesian  statisticians  as  possessing  tremendous  advantages  in  the  sense 
that,  when  treated  pragmatically,  their  methodologies  and  philosophies  can  be 
used  to  advise  the  investigator  how  to  handle  his  down-to-earth  practical 
problems  involving  scientific  data,  in  a  very  realistic  and  doubtlessly 
superior  manner. 

Suppose  that  the  statistician  is  concerned  with  a  p  *  1  vector  §  of 
parameters  of  interest;  let  x  represent  his  observations.  To  be  able  to 
make  a  Bayesian  inference  about  §  he  is  generally  faced  with  three  compli¬ 
cated  practical  problems.  These  are 

(1)  Modelling  the  form  of  the  sampling  distribution  p(x|(3,  £,)  of  x  given 

and  (perhaps)  further  parameters  £. 

(2)  Modelling  the  prior  structure  i.e.  the  form  of the  prior  distribution 

ir(||X)  of  (5  given  some  hyperparameters  A. 

(3)  Ascertaining  suitable  values  for  the  hyper parameters  A. 

Of  these,  (1)  is  particularly  important  when  the  elements  of  x  are 
continuous  measurements;  for  categorical  data  Poisson  or  multinomial  assump¬ 
tions  often  suffice.  The  Bayesian  non-para me trie  estimation  of  sampling 
densities  is,  for  example,  discussed  by  Leonard  (1973  and  1978),  and  Atilgan 
and  Leonard  (1981).  The  Bayesian  modelling  of  sampling  distributions  seems  to 
me  to  provide  one  of  the  most  important  furture  directions  for  our  subject, 
considerable  more  work  is  needed  in  this  area.  This  aspect  will  be  discussed 
further  in  Section  4. 


The  modelling  (2)  of  prior  structures  is  equally  important,  and  more  dif¬ 
ficult  because  the  data  are  not  so  immediately  relevant.  It  seems  necessary 
to  avoid  the  obvious  retreat  to  conjugacy,  except,  perhaps,  for  the  linear 
statistical  model  (conjugate  priors  only  exist  for  a  restrictive  class  of 
sampling  models  and  in  non-normal  cases  they  possess  quite  restrictive  co- 
variance  structures).  In  fact,  De  Finetti's  concept  of  exchangeability  pro¬ 
vides  us  with  one  way  of  thinking  about  prior  structures.  We  could 

(a)  Seek  a  suitable  p  *  1  vector  X  transformations  of  the  elements  of 
{3  such  that  we  are  prepared  to  take  the  elements  of  X  to  be  a  priori 

exchangeable. 

(b)  Model  a  suitable  exchangeable  distribution  for  the  elements  of  X  e.g. 
by  employing  the  two  stage  structure  recommended  by  De  Finetti’s  theorem;  or  a 
first  stage  prior  plus  empirical  procedures  at  the  second  stage. 

These  ideas  will  be  discussed,  for  particular  examples,  in  later  sec¬ 
tions.  Both  (a)  and  (b)  should  in  general  be  based  upon  such  aspects  as  prag¬ 
matic  judgement  in  relation  to  the  scientific  background,  intuition  about  how 
substantively  particular  assumptions  are  likely  to  affect  the  posterior  con¬ 
clusions,  and  judgements  about  how  reasonable  the  posterior  estimates  are  in 
statistical  terms  (e.g.  there  is  a  conceptual  duality  between  estimation  pro¬ 
cedures  and  the  undelying  prior  assumptions ).  Both  the  exchangeable  distribu¬ 
tion  and  transformations  may  be  taken  to  depend  upon  the  hyperparameters  X. 

The  statistician  is  finally  faced  with  problem  (3)  i.e.  ascertaining 
appropriate  values  for  the  hyperparameter  X.  When  the  dimension  of  X  is 
small  compared  with  §  the  data  will  frequently  possess  considerable  informa¬ 
tion  about  X.  The  information  about  £  and  X  is  summarized  by  their 
marginal  likelihood 

p(x|£»^)  =  /  P<x|£'£)1T(§h)d&  .  (1.1) 
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We  see  that ,  rather  than  pursuing  the  unenviable  task  of  extracting  the 
subjective  beliefs  from  a  scientist,  it  is  possible  to  use  the  marginal  like¬ 
lihood  to  simply  estimate  X  from  the  data.  A  complicated  way  of  doing 
this  is  to  assign  a  hyperprior  to  X,  yielding  a  hierarchical  (two-stage) 
prior  for  §  (see,  for  example,  Lindley  and  Smith,  (1972)  and  the  obvious 
hyperposterior  for  X  by  multiplication  with  the  marginal  likelihood  in 
(1.1).  However,  whilst  uninformative  hyperpriors  can  be  useful  (e.g.  Leonard, 
1976),  it  would  be  difficult  for  the  applied  worker  to  model  the  structure  of 
informative  hyperpriors.  Also,  formal  Bayesian  procedures  for  estimating 
unconditionally  from  X,  typically  become  very  tedious  computationally. 

We  will  instead  estimate  £  and  X  by  jointly  maximizing  the  marginal 
likelihood  in  (1.1).  The  EM  algorithm  provides  us  with  an  excellent  compu¬ 
tational  scheme  for  doing  this  in  a  very  wide  range  of  situations.  Estimates 
for  £  will  be  proposed  which  provide  good  approximations  to  formal  hier¬ 
archical  Bayes  estimates  whenever  the  marginal  likelihood  is  moderately 
informative. 

In  situations  where  the  objective  of  the  analysis  is  to  gain  insight  from 
a  data-set,  the  idea  of  empirically  estimating  the  hyper parameters,  rather 
than  specifying  them  a  priori,  seems  appealing.  I  think  that  a  coherent 
Bayesian  would  simply  be  attempting  an  impossible  task  if  he  tried  to  con¬ 
struct  an  entire  multivariate  distribution  just  based  upon  subjective  informa¬ 
tion.  However,  if  we  think  in  terms  of  the  specification  of  a  meaningful 
prior  structure,  with  any  spare  hyper parameters  estimated  empirically  from  the 
data,  then  the  procedure  will  often  make  more  statistical  sense. 
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2.  The  EM  Algorithm* 

It  is  possible  to  apply  the  algorithm  developed  by  Dempster  et.  al.  to 
the  estimation  of  £  and  X  by  the  maximum  of  the  marginal  likelihood  in 

(1*1).  In  the  present  context  this  involves  the  following  two  steps. 

*  * 

Expectation  Step  (E-Step).  Using  the  latest  vectors  £  and  A  for  £  and 
X  calculate  the  expectation 

=  E[1°9  P(x  |  0  )  +  logir((S|X)]  (2.1) 

where  the  expectation  should  be  taken  with  respect  to  the  posterior 

A  A 

distribution  of  g,  given  5  =  and  A  =  A 

The  Maximization  Step  (M-step).  Obtain  new  values  for  £  and  X  by 

.  *  * 

maximizing  the  expectation  G(£#X|£  ,X  )  obtained  at  the  E-step.  Return  to 
the  E-step  and  keep  cycling  until  convergence. 

In  a  University  of  Wisconsin  1981  technical  report,  C.  F.  Wu  was  the 
first  to  give  general  conditions  under  which  this  procedure  converges  to  the 
maximum  of  the  marginal  likelihood  (1.1).  The  above  procedure  for  hyper¬ 
parameters  has  been  employed  in  particular  cases  by  Laird  (1978)  and  Chen 
(1980). 


Example ;  Multiple  Regression  Models. 

Consider  firstly  the  estimation  of  the  parameters  of  several  multiple 

regressions  (e.g.  Lindley  and  Smith,  1972,  Smith  1973).  Suppose  we  observe 

vectors  Y  ,  of  respective  dimensions  n.,**»,n  ,  where 

1  1m 

i 

and  that  may  be  viewed  as  exchangeable  with 

|(i,C  -  IN((J,C). 

This  assumption  will  frequently  be  appropriate  when  we  possess  a  symmetry 
of  prior  ignorance  about  the  g^.  It  greatly  assists  us  in  modelling  the 
prior  structure,  since  we  need  now  just  concern  ourselves  with  the  within- 
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regression  covariance  structure  represented  by  C.  This  however  may  be  simply 

estimated  from  the  data,  so  that  no  further  prior  modelling  is  required.  When 
2  2  2  2 

|J  and  0  =  '  are  *nown'  posterior  distribution  of  the  ^ 

is 

Site'S*?2  ~  in(|*,d.) 

where 

3*  =  (a-2xTx.  +  C_1 )-1 (a“2xTy.  +  C-1u)  (i  =  1, (2.1) 

^  ^i^i  v  ^  ^i^i  v  v  9  9 

and 

-1  -2  T  -1 

Si  -  X.X.  +  C  .  (2.2) 

Rather  than  bringing  in  a  set  of  complicated  prior  assumptions  for 
2 

\J,C,  and  0  it  is  much  more  straightforward  to  simply  empirically  estimate 
this  quantities  via  their  marginal  likelihood.  Combinations  of  the  E-step  and 
M-step  in  Section  2  tells  us  that  the  marginal  maximum  likelihood  estimates 
satisfy  the  equations 

Id  -  £  =  n>_1  l  (2.3) 

i=l 


c  =  m'1  l  (|*  -  &*)(£*  -  £*)T  +  m_1trace(D.)  (2.4) 

i=1 


and 


2  -1  r  *  T  *  -1  T 

0.  =  n.  )  (Y.  -  X.P.)  (Y.  -  X.0.)  +  n.  trace (X . D . X . )  (2.5) 

i  i  *•  -i^i  9  >1*^ 


i*1 


where  ^  and  satisfy  (2.1)  and  (2.2), 


Moreover,  convergence  is  guaranteed  if  we  substitute  trial  values  for 
2 

^,C  and  0  into  the  right  hand  sides  of  (2.1)  and  (2.2),  and  put  the 

* 

values  obtained  for  £  and  into  the  right  hand  sides  of  (2.3)  -  (2.5), 

then  return  to  (2.1)  and  (2.2)  and  repeat  the  procedure  until  convergence. 

In  the  prior  ignorance  case,  the  procedure  proposed  by  Smith  reduces  to 

the  above  equations  but  with  the  important  second  terms  in  the  right  hand 

* 

sides  of  (2.4)  and  (2.5)  omitted.  Hence  overshrinkages  towards  (5  were 


observed  in  his  numerical  example.  With  the  extra  terms  included  large 
shrinkages  may  still  occur,  but  only  when  the  data  suggest  that  this  is 
reasonable. 

Let  us  now  turn  to  the  single  multiple  regression  situation 

Y|(5,a2  ~  iNCx^a2^) 

and  suppose  that  the  columns  of  X  relate  to  p  different  sets  of  observed 
explanatory  variables.  We  now  make  the  (obvious)  claim  that  no  general 
purpose  non-uniform  prior  structure  exists,  and  that  meaningful  prior 
structures  can  only  be  developed  if  we  utilize  background  information 
concerning  the  nature  of  the  explanatory  variables,  or  informative  prior 
knowledge  about  8.  In  the  absence  of  such  information  we  should  simply 
estimate  8  by  least  squares. 

In  particular,  the  exchangeable  prior 
£|t2  -  NtO,!2^) 

has  been  proposed  as  a  means  of  justifying  the  ridge  estimator 
*  T  -IT 

8  =  (X  X  +  kl  )  X  Y  (2.6) 

since  the  posterior  mean  of  £  may  be  obtained  by  setting 

k  =  o2/t2  (2.7) 

in  (2.6). 

However,  neither  exchangeability  nor  ridge  regression  seem  appropriate  in 
prior  ignorance  situations.  Any  non-trivial  linear  transformation  on  the  X 
matrix  focuses  attention  on  a  new  set  of  parameters  which  are  not  exchangeable 
if  the  elements  of  8  are  exchangeable.  In  prior  ignorance  situations  there 
is  no  way  of  discerning  to  which  form  of  the  model  the  ridge  estimator  should 
be  applied.  It  therefore  does  not  make  any  sense  to  adjust  any  set  of  param¬ 
eter  estimates  towards  zero,  or  towards  any  other  origin;  and  the  least  squares 
vector  possesses  greater  statistical  viability. 


If  some  way  could  be  found  of  justifying  the  estimator  in  (2.6)  then  it 

2  2 

would  of  course  be  easy  to  estimate  o  and  t  ,  and  hence  k,  by  the  EM 
algorithm*  The  equations  are 

o2  =  n"1  (Y  -  X(S*)T(Y  -  X|*)  +  n-1  trace  (X  D  XT)  (2.8) 

and 

t2  =  p~VT(5*  +  trace  (D)  (2.9) 

where 

-1  -2  T  -2 

g  =  0  X  X  +  I  I  .  (2.10) 

P 

It  is  possible  to  show  that  the  solutions  to  these  equations  are 
noticeably  more  conservative  in  shrinking  towards  zero  than  is  the  ridge  trace 
method  of  Hoerl  and  Kennard  (1971). 

However,  it  is  particularly  important,  in  this  single  multiple  regression 
situation  to  base  any  deviation  from  the  least  squares  vector  upon  definite 
prior  knowledge. 
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3.  Smoothing  the  Probabilities  in  a  Histogram* 

Consider  a  grouped  histogram  concentrated  on  a  bounded  interval  (a,b), 

and  with  s  cells  I  ,1  .***,1  of  equal  width.  Let  9  ,•••,0  sum  to 

1  2  s  is 

unity  and  denote  the  cell  probabilities,  taken  in  order,  and  let 
summing  to  n,  denote  the  cell  probabilities.  We  let  the  x’s  satisfy  the 
usual  multinomial  assumptions  given  the  9's,  and  provide  a  number  or  refine¬ 
ments  and  improvements  to  the  method  proposed  by  Leonard  (1973)  for  obtaining 
smooth  estimates  of  the  9's, 

It  is  easier  to  think  in  terms  of  prior  structures  by  making  multivariate 

normal  assumptions  about  the  logits  Y  satisfying 

is 

y .  s  y 

0.  =  e  -1/  l  e  S  (j  =  s)  (3.1) 

3  j=1 

In  particular  we  assume  that  the  log-contrasts 

S.  *  Y.  -  Y.+1  (j  =  1 # )  (3.2) 

possess  prior  covariance  structure 

cov(£.,E  )  =  a2 (0  <  a2  <  00 ;  |p|  <  1).  (3.3) 

3  K 

We  originally  assumed  the  structure  in  (4.3)  for  the  logits  themselves, 

but  when  considering  the  continuous  case  (see  Leonard,  1978),  it  became 

apparent  that,  by  taking  differences  first,  a  more  reasonable  smoothing 

(avoiding  too  much  flattening)  of  the  9's  would  result.  The  hyperparameter 

p  measures  the  degree  of  smoothness  of  the  hypothetical  true  density  of  the 

2 

raw  observations,  and  0  measures  the  ’closeness'  to  the  'null  hypothesis' 
discussed  below. 

We  hence  assume  that  the  vector  *£  possesses  a  multivariate  normal  prior 
distribution 

ll(i,02,P  -  N(|J,C)  (3.4) 


where 


T 

C  ~  B  A  B 


(3.5) 
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with  the  (j,k)t  element  of  the  ( s— 1  )  x  (s-1)  matrix  A  equal  to  the 
quantity  in  (4.3)  and 


(3.6) 


0  0  0  - 

1  0  0 

1  1  0 

1110 
!  1  0. 
•  k 

I 

I 

1  1  1 


The  choice  of  the  prior  vector  \i  may  be  based  upon  the  statistician's 

'null  hypothesis'  about  the  probabilities  in  the  histoyram.  For  example,  if 

his  hypothesis  that  the  underlying  density  of  the  raw  observations  is  f^(t) 

for  t  G  (a,b)  then  he  should  set 

U.  =  log[  /  f  (t)dt)  for  j  =  !,•••, s.  (3.7) 

3  I  U 

3 

If  any  unknown  parameters  appear  in  his  choice  of  fQ(t)  then  we  suggest 

that  these  should  be  estimated  from  the  data  by  conventional  techniques.  The 

2 

shrinkage  parameter  o  plays  the  role  of  the  significance  level  in  the 
standard  chi-square  goodness  of  fit  test. 

In  this  situation  we  of  course  do  not  have  exchangeability  of  the  Y_.  or 
the  5  •  However,  the  second  order  lagged  differences 

(Y  -U  -  Y  +U  )-  p(Y  -  U  -  Y  +  u  ) 

'  3  +2  j  +2  Yj+1  j+1  3+1  3  V 

are  exchangeable  for  j  =  1,**#,s-2.  This  ties  in  with  the  philosophy 
outlined  in  Section  1  of  seeking  appropriate  transformations  of  the  parameters 
which  are  exchangeable.  The  lagged  differences  create  an  autoregressive  type 
smoothing  on  the  9*s  so  that  the  posterior  estimates  of  0^  will  take 
account  of  estimates  in  adjacent  intervals. 
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When  ,  and  p  are  specified,  we  have  the  approximate  posterior 


distribution 


Xl$'H'a2p  ~  N(X,D) 


where  X  tlie  posterior  mode  vector,  and  satisfies  the  non-linear  system 
n  I  =  $  ~  S_1(X  -  (i)  (3.9) 

V  NS 

Y.  Y 

where  9  =  (6  ,•••,8  )T,  with  B,  =  e  e  and 
5  s  3 


D  1  =  n[diag(9  , • • • , 6  )  -  99^ )  +  C  1 

V  Is  vv  V 


(3.10) 


denotes  the  posterior  information  matrix. 


This  system  can  be  solved  via  Newton-Raphson,  and  the  EM  algorithm  can  be 


2  2 

used  to  (approximately)  estimate  o  and  o  p  by 

o2  =  s  1  £  (£.  -  n.)2  +  s  1  £  r.. 

j-i  3  3  j=i  33 


(3.11  ) 


where 


where 


o  1  s“2  ^  ^  iS“2 

0  p  =  S  l  <€.  -  n.)<€.  -  n  ,)  +  s  T  r.  .  .  (3.12) 

j=i  3  33  3+1  j=i  3'3+1 


y  -  Y  .  ,  -  #  H  .  a  M  .  -  M.  ,  ,  and  r  •  v  is  the  ( j  ,k)^  element  of 

3  3  +  ‘  3  3  3+1  j* 

T 

R  =  G  D  G 

NS  NS  NS  NS 


Cyclic  substitutions  are  again  appropriate.  See  Leonard  (1978,  p.  115)  for  a 
numerical  example. 


-11- 


4*  Simultaneous  Estimation  for  Several  Multinomial  Distributions. 

Suppose  now  that,  for  i  =  the  elements  of  the  s  x  1  vector 

possess  a  multinomial  distribution  with  sample  size  n^,  given  the 

T 

vector  of  ceall  probabilities  9.  *  (&.#•••, 9.  )  ,  and  that  these  m 

1  i  is 

multinomial  distributions  are  independent,  given  the  parameters.  We  now 

T 

have  m  sets  of  logit  vectors  ^  , •  •  • , Y^g )  satisfying 

Y .  .  Y. 

9.  .  =  e  ^/I  e  lg  (i  =  !,•••, m;  j  =  !,•••, s).  (4.1) 

13  g 

As  in  the  several  regression  line  situation,  it  will  often  be  appro¬ 
priate  to  assume  exchangeability  between  the  parameter  vectors  9  ,•••,9 

w  I  ""m 

rather  than  between  the  elements  within  any  particular  vector.  In  this  case 
we  assume 


XjfcfQ  -  INdi,Q)  (i  =  1,  •••,!!») 

yielding  the  approximate  posterior  distribution 

where 

ni-i  =  5i  "  £’1(Xi  -  <i  -  1, •••,») 

with  9.  =  (0. .,•••, 0.  )T  where 
vl  11  is 

9 


(4.2) 


and 


-1 


Y .  .  Y. 

=  e  »,l  e  * 

g 


6~  '  =  ni(diag(9ii  J  -  6^6*}  +  C~  ' .  (4.3) 

As  we  have  replications  on  ^  and  C  is  is  no  longer  necessary  to 
assume  a  specific  covariance  structure  like  (4.3)  within  each  multinomial, 
since  and  C  can  be  estimated  in  their  entirety  from  the  data;  gives 

H  -  X 

and 


-1 


-1 


m 


e  =  m  '  I  <Xi  -  X.^Xi  -  V 

i=1 


+  m 


»«  . 
i  v 


i*1 


which  may  t>e  solved  iteratively  together  with  (5.2)  and  (5.3). 
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In  the  first  six  columns  of  Table  1  we  give  the  percentages  of  pupils 
respectively  obtaining  grades  one  to  six  on  a  test  taken  at  40  different 
schools.  We  assumed  exchangeability  between  the  schools  and  obtained  the 
following  estimates  for  \i,  diag(C),  and  the  correlation  matrix  B  associ¬ 
ated  with  C: 


for  the  0 . , 


and 


diag(C ) 


B  = 


>.51,  0. 

41,  0.56 

,  0.59 

,  -0.82 

T 

,  -0.24) 

prior 

estimate 

of 

087,  0. 

217,  0.255,  0. 

262,  0. 

064,  0.115)T 

=  (1.08,  0.39, 

0.25, 

0.26, 

0.46,  0.92), 

1 

0.79 

0.57 

-0.07 

-0.38  -0.57 \ 

0.  79 

1 

0.79 

0.29 

-0.14  -0.31  \ 

0.57 

0.79 

1 

0.61 

0.21  -0.01 

-0.07 

0.29 

0.61 

1 

0.68  0.62 

-0.38 

-0.  14 

0.21 

0.68 

1  0.83  / 

-0.51 

-0.31  - 

0.01 

0.62 

0.83  1  / 

Note  that  the  matrix  B  gives  moderately  high  correlations  between  adjacent 

cells  within  each  school  and  negative  correlations  between  Y. .  and  Y.  .  , 

ID  i,D+k 

for  |k|  >  2.  The  between  school  exchangeability  has  helped  us  to  estimate 
the  prior  structure  within  each  school.  This  is  similar  in  spirit  to  the 
autoregressive  structure  in  (4.3)  since  it  enables  us  to  take  account  to  the 
ordering  of  the  cells. 

The  smoothed  percentages  in  Table  1  smooth  the  observed  percentages  (a) 
by  shrinkages  utilizing  the  collateral  information  in  the  common  vector 
100^  by  smoothing  within  each  school;  using  the  convariance  structure 
estimated  via  C.  Note,  for  example,  that  the  zeros  create  no  extra  problem: 
their  smoothed  values  are  all  positive,  the  amount  depending  on  collateral  and 
within  school  information. 
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TABLE  1:  OBSERVED  AND  SMOOTHED  PERCENTAGES 


OBSERVED  SMOOTHED 


iV 

1 

2 

3 

4 

5 

6 

1 

2 

3 

4 

5 

6 

ni 

1 

6.7 

17.8 

24.4 

28.9 

6.7 

15.6 

6.6 

18.8 

24.2 

28.2 

7.2 

15.0 

,  45 

2 

0.0 

21.6 

24.3 

18.9 

13.5 

21.6 

3.8 

16,9 

22.2 

26.2 

9.9 

20.9 

37 

3 

5.3 

15.8 

42.1 

26.3 

5.3 

5.3 

8.3 

21.4 

29.1 

26.5 

5.8 

8.9 

19 

4 

22.2 

25.9 

29.6 

11.1 

7.4 

3.7 

19.5 

26.9 

26.5 

17.9 

4.2 

4.9 

27 

5 

16.7 

33.3 

11.1 

16.7 

5.6 

16.7 

12.7 

25.4 

22.8 

22.3 

5.7 

11.1 

18 

6 

5,9 

7.4 

26.5 

33.8 

8.8 

17.6 

4.8 

12.9 

23.8 

31.5 

8.9 

18.0 

68 

7 

31.7 

17.1 

14.6 

19.5 

9.8 

7.3 

24.1 

22.1 

21.7 

19.1 

5.7 

7.4 

41 

8 

4.5 

4.5 

22.7 

31.8 

9.1 

27.3 

3.6 

12.4 

20.2 

29.7 

9.6 

24.5 

22 

9 

16,1 

45.2 

19,4 

9.7 

3.2 

6.4 

17.5 

33.6 

23.8 

16.8 

3.4 

5.0 

31 

10 

13.0 

31.5 

31.5 

24.1 

0.0 

0.0 

14.9 

30.3 

29.1 

20.5 

2.4 

2.9. 

54 

11 

23.5 

35.3 

20.6 

20.6 

0.0 

0*0 

22.7 

31.7 

24.3 

16.5 

2.2 

2.6 

34 

12 

22.8 

26.3 

21.1 

15.8 

3.5 

10.5 

19.8 

26.2 

23.0 

18.7 

4.4 

7.8 

57 

13 

7.1 

14.2 

14.2 

32.1 

10.7 

21.4 

5.3 

15.7 

20.5 

29.0 

9.0 

20.4 

28 

14 

13.9 

33.3 

25.0 

19.4 

0.0 

8.3 

14.2 

29.2 

25.6 

20.9 

3.7 

6.5 

36 

13 

0.0 

18.2 

13.6 

54.5 

9.1 

4.5 

4.2 

17.2 

23.2 

34.7 

7.4 

13.4 

22 

16 

12.5 

31.3 

18.8 

18.8 

6.3 

12.5 

11.3 

24.8 

24.6 

23.4 

5.7 

10.1 

16 

17 

0.0 

9.4 

25.0 

50.0 

9.4 

6.3 

3.3 

14.3 

24.4 

36.3 

8.0 

13.7 

32 

18 

0.0 

5.3 

15.8 

21.1 

26.3 

31.6 

2.0 

9.1 

16.1 

26.6 

13.9 

32.3 

19 

19 

0.0 

3.7 

7.4 

14.8 

3.7 

70.4 

0.7 

4.6 

8.4 

20.0 

9.4 

57.1 

27 

20 

2.4 

7.3 

19.5 

36.6 

4.8 

29.3 

2.7 

11.3 

18.8 

31.9 

8.4 

27.0 

41 

21 

29.2 

16,7 

8.3 

37.5 

4.2 

4.2 

19.5 

23.5 

22.8 

23.3 

4.4 

6.5 

24 

22 

14.2 

21.4 

42.9 

21.4 

0.0 

0.0 

14.8 

26.4 

29.0 

21.2 

3.7 

4.8 

14 

23 

0.0 

17.4 

34.8 

17.4 

17.4 

13.0 

4.8 

17.2 

25.5 

26.8 

9.7 

15.9 

23 

24 

5.6 

19.4 

30.1 

22.2 

11.1 

11.1 

7.1 

19.9 

26.5 

26.1 

7.9 

12.5 

36 

25 

31.3 

18.8 

28.1 

15.6 

3.1 

3.1 

25.6 

25.3 

25.5 

16.7 

3.2 

3.7 

32 

26 

9.5 

19.0 

42.9 

14.3 

0.0 

14.3 

10.1 

22.9 

28.4 

23.5 

5.2 

9*8 

21 

27 

5.6 

37.0 

22.2 

25.9 

3.7 

5.6 

8.8 

30.0 

25.4 

24.5 

4.3 

7.1 

54 

28 

0.0 

30.8 

7.7 

30.8 

0.0 

30.8 

4.4 

17.7 

20.2 

28.7 

7.4 

21.6 

13 

29 

4.0 

12.0 

16.0 

32.0 

8.0 

28.0 

3.7 

13.6 

19.3 

,29.5 

9.1 

24.8 

25 

30 

0.0 

16.7 

27.8 

33.3 

5.6 

16.7 

4.5 

17.1 

23.9 

30.1 

7.6 

16.8 

18 

31 

3.7 

11.1 

37.0 

37.0 

0.0 

11.1 

5.8 

18.0 

27.6 

30.9 

5.8 

11. 8 

27 

32 

0.0 

42.9 

28.6 

21.4 

7.1 

0.0 

9.6 

27.6 

27.4 

23.6 

4.8 

6.9 

14 

33 

0.0 

14.3 

28.6 

21.4 

21.4 

14.3 

4.5 

15.8 

23.2 

28.1 

10.1 

18.2 

14 

34 

19.5 

39.0 

17.1 

22.0 

0.0 

2.4 

19.4 

33.0 

23.4 

18.1 

2.5 

3*5 

41 

35 

37.9 

13.8 

37.9 

6.9 

3.4 

0.0 

31.8 

24.6 

26.2 

12.8 

2.4 

2.1 

29 

36 

0.0 

18.8 

6.2 

25.0 

6.3 

43.8 

2.5 

11.9 

15.9 

27.2 

9.4 

33.1 

16 

37 

13.3 

20.0 

20.0 

26.7 

6.7 

13.3 

9.8 

21.6 

24.5 

25.7 

6.4 

11.8 

15 

38 

16.7 

37.5 

41.7 

4.2 

0.0 

0.0 

20.3 

32.4 

28.3 

14.3 

2.2 

2.3 

24 

39 

18.3 

31.7 

21.7 

15.0 

6.7 

6.7 

17.5 

28.7 

24.0 

18.4 

4.8 

6.5 

60 

40 

0.0 

11.1 

11.1 

33.3 

22.2 

22.2 

3.4 

13.3 

19.7 

29.4 

10.5 

23.7 

9 
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